This workshop aims to use the incredible quanteda
package to analyze the television series “How I Met Your Mother” and
demonstrate many of the quanteda package’s tools. We will
explore the characters, identify adjectives, render Wordclouds, network
plots and even sentiment analysis.
Plot: “Ted has fallen in love. It all started when his best friend, Marshall, drops the bombshell that he plans to propose to longtime girlfriend Lily, a kindergarten teacher. Suddenly, Ted realizes that he had better get a move on if he hopes to find true love. Helping him in the quest is Barney, a friend with endless – often outrageous – opinions, a penchant for suits and a foolproof way to meet women. When Ted meets Robin, he is sure it’s love at first sight, but the affair fizzles into friendship. Voice-over by Bob Saget (”Full House”) tells the story through flashbacks.”
Source: Rotten Tomatoes
![]()
Ted
Actor: Josh Radnor

Barney
Actor: Neil Patrick Harris

Robin
Actor: Cobin Smulders

Marshall
Actor: Jason Segel

Lily
Actor: Alyson Hannigan
“The story of five friends sitting in their favorite booth at MacLaren’s, their lives unfolding in front of each other, How I Met Your Mother is heartwarming and hilarious at the same time. Some believe that HIMYM is Ted’s story. Others think that it is Marshall and Lily’s story. And there’s a whole school of thought that it’s no one else but Barney’s story. We would like to think that it’s all of their stories because there won’t be a Ted without Barney or a Lily without Marshall, and definitely no Robin without a Ted (and Barney too). That’s how crucial each of the members of this group is, playing a major role in each other’s lives, helping them grow and become what they wanted to be.”
Source: Collider

These will be the libraries we will use for our analysis. In every line, you will find the purpose of it.
library(readtext) #For import and Handling for Plain and Formatted Text Files.
library(rvest) #For easily Harvest (Scrape) Web Pages.
library(xml2) #For working with XML files using a simple, consistent interface.
library(polite) #For be responsible when scraping data from websites.
library(httr) #Package for working with HTTP organised by HTTP verbs
library(tidyverse) #Opinionated collection of R packages designed for data science.
library(tidytext) #Functions and supporting data sets to allow conversion of text.
library(quanteda) #OUR PACKAGE for text analysis.
library(quanteda.textstats) #OUR SUBPACKAGE for text statistics.
library(quanteda.textplots) #OUR SUBPACKAGE for text plots.
library(stringr) #Consistent Wrappers for Common String Operations.
library(spacyr) #NLP package that comes from Python that help us classify words.
library(ggsci) #Collection of high-quality color palettes.
library(ggrepel) # ggrepel provides geoms for ggplot2 to repel overlapping text labels
library(RColorBrewer) #Beautifull color palettes.
library(cowplot) #Package to put images in our plots.
library(magick) #Package for save images in our environment
library(gghighlight) #gghighlight() adds direct labels for some geoms.
#Set image
obj_img <- image_read(path = "https://bit.ly/3twmH2Y")We will do a web scraping of our favorite TV show: “How I Met Your Mother.” For the above, we will do web scraping to obtain the scripts of the 208 episodes that the TV show has. We will define the URLs, obtain the information to know if we can do web scraping, and name the directory where we want to save our files.
v_tv_show <- "how-i-met-your-mother"
v_url_web <- "http://www.springfieldspringfield.co.uk/"
#Remember to be polite and know if we can web scrap the webpage
session_information <- bow(v_url_web) #Do a bow with the polite package
session_information
v_url <- paste(v_url_web,"episode_scripts.php?tv-show=", v_tv_show, sep="")
#Identify yourself
rvest_himym <- session(v_url,
add_headers(`From` = "jurjoo@gmail.com",
`UserAgent` = R.Version()$version.string))
#Start web scrap
html_url_scrape <- rvest_himym %>% read_html(v_url)
node_selector <- ".season-episode-title"
directory_path <- paste("texts/how-i-met-your-mother/", v_tv_show, sep = "")##🪡 Loop for download TV scripts
### 02.02.01.-scrape href nodes in .season-episode-title-------------------------
html_url_all_seasons <- html_nodes(html_url_scrape, node_selector) %>%
html_attr("href")
### 02.02.02.-One loop for all our URL's----------------------------------------
for (x in html_url_all_seasons) {
read_ur <- read_html(paste(v_url_web, x, sep="/"))
Sys.sleep(runif(1, 0, 1)) #Be polite
# Element node that was checked and that contain the place of the scripts.
selector <- ".scrolling-script-container"
# Scrape the text
text_html <- html_nodes(read_ur, selector) %>%
html_text()
# Last five characters of html_url_all_seasons for saving this to separate text files (This is our pattern).
sub_data <- function(x, n) {
substr(x, nchar(x) - n + 1, nchar(x))
}
seasons_final <- sub_data(x, 5)
# Write each text file
write.csv(text_html, file = paste(directory_path, "_", seasons_final, ".txt", sep=""), row.names = FALSE)
}url_himym <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_episodes"
rvest_himym_table <- session(url_himym,
add_headers(`From` = "jurjoo@gmail.com",
`UserAgent` = R.Version()$version.string))
l_tables_himym <- rvest_himym_table %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = TRUE)
#This generates a list with all the tables that contain the page. In our case,
#we want the table from the second element till the 10th.
l_tables_himym <- l_tables_himym[c(2:10)]
l_tables_himym[1]## [[1]]
## # A tibble: 22 × 8
## No.overall `No. inseason` Title Direc…¹ Writt…² Origi…³ Prod.…⁴ US vi…⁵
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 1 "\"Pilot\"" Pamela… Carter… Septem… 1ALH79 10.94[…
## 2 2 2 "\"Purple … Pamela… Carter… Septem… 1ALH01 10.40[…
## 3 3 3 "\"Sweet T… Pamela… Phil L… Octobe… 1ALH02 10.44[…
## 4 4 4 "\"Return … Pamela… Kourtn… Octobe… 1ALH03 9.84[1…
## 5 5 5 "\"Okay Aw… Pamela… Chris … Octobe… 1ALH04 10.14[…
## 6 6 6 "\"Slutty … Pamela… Brenda… Octobe… 1ALH05 10.89[…
## 7 7 7 "\"Matchma… Pamela… Chris … Novemb… 1ALH07 10.55[…
## 8 8 8 "\"The Due… Pamela… Gloria… Novemb… 1ALH06 10.35[…
## 9 9 9 "\"Belly F… Pamela… Phil L… Novemb… 1ALH09 10.29[…
## 10 10 10 "\"The Pin… Pamela… Carter… Novemb… 1ALH08 12.27[…
## # … with 12 more rows, and abbreviated variable names ¹`Directed by`,
## # ²`Written by`, ³`Original air date`, ⁴Prod.code, ⁵`US viewers(millions)`
#Reduce the list in one data frame since all of the tables share the same structure
df_himym <- data.frame(Reduce(bind_rows, l_tables_himym))
#We do the same for the characters of HIMYM
url_himym_characters <- "https://en.wikipedia.org/wiki/List_of_How_I_Met_Your_Mother_characters"
rvest_himym_table_2 <- session(url_himym_characters,
add_headers(`From` = "jurjoo@gmail.com",
`UserAgent` = R.Version()$version.string))
l_tables_himym_characters <- rvest_himym_table_2 %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = TRUE)
df_characters <- as.data.frame(l_tables_himym_characters[[1]]) %>%
select(Character)
df_characters_w <- df_characters %>%
filter(!stringr::str_starts(Character, "Futu"),
!(Character %in% c("Character", "Main Characters",
"Supporting Characters"))) %>%
mutate(name = str_extract(Character,"([^ ]+)"),
name = replace(name, name == "Dr.", "Sonya"))
df_characters_w#We bind the tables with bind_rows
df_himym <- data.frame(Reduce(bind_rows, l_tables_himym))
df_himym_filt <- df_himym %>% filter(str_length(No.overall) < 4)
df_himym_filt_dupl <- df_himym %>% filter(str_length(No.overall) > 4)
#We are doing this particular wrangling to format in the best possible way our tables.
#Note that we are using stringr to manipulate our characters.
df_himym_filt_dupl_1 <- df_himym_filt_dupl %>%
mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 1, 3))),
No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 1, 2))),
Prod.code = replace (Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 1, 6)))
df_himym_filt_dupl_2 <- df_himym_filt_dupl %>%
mutate(No.overall = as.numeric(replace(No.overall, str_length(No.overall) > 4, substr(No.overall, 4, 6))),
No..inseason = as.numeric(replace(No..inseason, str_length(No..inseason) > 3, substr(No..inseason, 3, 4))),
Title = replace(Title, Title == "\"The Magician's Code\"", "\"The Magician's Code Part 2\""),
Title = replace(Title, Title == "\"The Final Page\"", "\"The Final Page Part 2\""),
Title = replace(Title, Title == "\"Last Forever\"" , "\"Last Forever Part 2\"" ),
Prod.code = replace(Prod.code, str_length(Prod.code) > 3, substr(Prod.code, 7, 12)))
df_himym_final <- bind_rows(df_himym_filt,
df_himym_filt_dupl_1,
df_himym_filt_dupl_2) %>%
arrange(No.overall, No..inseason) %>%
mutate(year = str_extract(Original.air.date, '[0-9]{4}+'),
Season = as.numeric(stringr::str_extract(Prod.code, "^.{1}"))) %>%
rename(Chapter = No..inseason)
df_himym_final$US.viewers.millions. <- as.numeric(str_replace_all(df_himym_final$US.viewers.millions., "\\[[0-9]+\\]", ""))
df_himym_finaldf_texts_himym <- readtext::readtext("texts/how-i-met-your-mother/*.txt")
v_season <- as.numeric(stringr::str_extract(df_texts_himym$doc_id, "\\d+"))
v_chapter <- as.numeric(stringi::stri_extract_last_regex(df_texts_himym$doc_id, "[0-9]+"))
df_texts_himym_w <- df_texts_himym %>% mutate(Season = v_season, Chapter = v_chapter)
df_himym_final_doc <- full_join(as.data.frame(df_texts_himym_w), df_himym_final, by = c("Season", "Chapter")) %>%
mutate(Season_w = paste("Season", Season),
Title_season = paste0(Title, " S", Season, " EP", Chapter))
df_himym_final_docPress the arrows in the top right corner of this interactive dataframe. As you can see, we have our final dataframe with the information our our TV show, number of season, episode, etc.
Look our corpus, it’s divided into types, tokens and even sentences.
corp_himym <- corpus(df_himym_final_doc) #Build a new corpus from the texts
docnames(corp_himym) <- df_himym_final_doc$Title
summary(corp_himym, n = 15)Look our tokenization, we separate our text into words. Amazing!
corp_himym_stat <- corp_himym
docnames(corp_himym_stat) <- df_himym_final_doc$Title_season
corp_himym_s1_simil <- corpus_subset(corp_himym_stat, Season == 1) #We want to analyze just the first season
toks_himym_s1 <- tokens(corp_himym_s1_simil, #corpus from all the episodes from the first season
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_remove(stopwords("english")) #Remove stop words of our texts
toks_himym_s1## Tokens consisting of 22 documents and 12 docvars.
## "Pilot" S1 EP1 :
## [1] "x" "OLDER" "TED" "Kids" "gonna"
## [6] "tell" "incredible" "story" "story" "met"
## [11] "mother" "punished"
## [ ... and 1,462 more ]
##
## "Purple Giraffe" S1 EP2 :
## [1] "x" "OLDER" "TED" "Okay" "telling"
## [6] "us" "met" "Mom" "excruciating" "detail"
## [11] "Right" "back"
## [ ... and 1,374 more ]
##
## "Sweet Taste of Liberty" S1 EP3 :
## [1] "x" "S" "Sy" "Syn" "Sync" "Sync" "b" "OLDER" "TED"
## [10] "one" "night" "met"
## [ ... and 1,350 more ]
##
## "Return of the Shirt" S1 EP4 :
## [1] "x" "OLDER" "TED"
## [4] "Kids" "single" "looking"
## [7] "happily-ever-after" "one" "stories"
## [10] "can" "end" "way"
## [ ... and 1,477 more ]
##
## "Okay Awesome" S1 EP5 :
## [1] "x" "OLDER" "TED" "kids" "like" "hear" "story" "time" "went"
## [10] "deaf" "even" "ask"
## [ ... and 1,138 more ]
##
## "Slutty Pumpkin" S1 EP6 :
## [1] "x" "OLDER" "TED" "know" "Aunt" "Robin's"
## [7] "big" "fan" "Halloween" "Always" "dressing" "crazy"
## [ ... and 1,405 more ]
##
## [ reached max_ndoc ... 16 more documents ]
Please, take a look into our Document Feature Matrix. Look know how it is counting our ocurreces. We can do multiple things with them.
toks_himym_dm_s1 <- toks_himym_s1 %>%
dfm() #Convert our tokens into a document feature matrix
toks_himym_dm_s1## Document-feature matrix of: 22 documents, 4,890 features (87.39% sparse) and 12 docvars.
## features
## docs x older ted kids gonna tell incredible story
## "Pilot" S1 EP1 1 1 22 3 22 6 1 7
## "Purple Giraffe" S1 EP2 1 5 30 2 19 1 1 1
## "Sweet Taste of Liberty" S1 EP3 1 3 27 1 15 6 0 2
## "Return of the Shirt" S1 EP4 1 5 14 1 15 4 0 6
## "Okay Awesome" S1 EP5 1 3 9 2 11 4 0 5
## "Slutty Pumpkin" S1 EP6 1 1 16 0 15 1 0 3
## features
## docs met mother
## "Pilot" S1 EP1 11 1
## "Purple Giraffe" S1 EP2 10 0
## "Sweet Taste of Liberty" S1 EP3 1 1
## "Return of the Shirt" S1 EP4 0 0
## "Okay Awesome" S1 EP5 0 0
## "Slutty Pumpkin" S1 EP6 4 0
## [ reached max_ndoc ... 16 more documents, reached max_nfeat ... 4,880 more features ]
textstat_simil function we will find the similarity between episodes
tstat_simil <- textstat_simil(toks_himym_dm_s1) #Check similarity between episodes of the first season
clust <- hclust(as.dist(tstat_simil)) #Convert our object into a cluster (For visualization purposes)
dclust <- as.dendrogram(clust) #Convert our cluster into a dendrogram (For visualization purposes)
dclust <- reorder(dclust, 1:22) #Order our visualization#Seetle colors
nodePar <- list(lab.cex = 1, pch = c(NA, 19),
cex.axis = 1.5,
cex = 2, col = "#0080ff")
par(mar = c(18.1, 6, 2, 3))
#Plot dendogram
plot(dclust, nodePar = nodePar,
las = 1,
cex.axis = 2, cex.main = 2, cex.sub = 2,
main = "How I Met Your Mother Season 1",
type = "triangle",
ylim = c(0,1),
ylab = "",
edgePar = list(col = 4:7, lwd = 7:7),
panel.first = abline(h = c(seq(.10, 1, .10)), col = "grey80"))
title(ylab = "Similarity between episodes (correlation %)", mgp = c(4, 1, 1), cex.lab = 2)
rect.hclust(clust, k = 5, border = "red")Look how amazing the similairty it is.

textstat_dist function. Here distance is the opposed of similarity. More distance equals less similar.
tstat_simil <- textstat_simil(toks_himym_dm_s1) #Check similarity between episodes of the first season
clust <- hclust(as.dist(tstat_simil)) #Convert our object into a cluster (For visualization purposes)
dclust <- as.dendrogram(clust) #Convert our cluster into a dendrogram (For visualization purposes)
dclust <- reorder(dclust, 1:22) #Order our visualizationpar(mar = c(21, 6, 2, 3))
#Plot dendogram
plot(dclust_dist, nodePar = nodePar_2,
las = 1,
cex.axis = 2, cex.main = 2, cex.sub = 2,
main = "How I Met Your Mother Season 1",
type = "triangle",
ylim = c(0, 120),
ylab = "",
edgePar = list(col = 11:19, lwd = 7:7),
panel.first = abline(h = c(seq(10, 120, 10)), col = "grey80"))
title(ylab = "Distance between episodes (correlation %)", mgp = c(4, 1, 1), cex.lab = 2)
rect.hclust(clust, k = 5, border = "red")
We will get the number of appearances by actor per season and episode.
#Remember our second step: tokenize our corpus.
toks_himym <- tokens(corp_himym, #corpus from all the episodes from the first season
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_remove(stopwords("english")) #Add additional words
dfm_actors <- toks_himym %>%
tokens_select(c("Ted", "Marshall", "Lily", "Robin", "Barney", "Mother")) %>% #We just want to analyze these characters
tokens_group(groups = Season) %>% #We group our tokens (scripts) by season
dfm() #Transform the token into a DFM object
df_final_actors <- as.data.frame(textstat_frequency(dfm_actors, groups = c(1:9))) %>%
mutate(Season = paste("Season", group),
`Principal Characters` = replace(feature, is.character(feature), str_to_title(feature))) %>%
select(-feature)
df_final_actorsHere, we plot this frequency of actors
# Plot frequency of actors
ggplot1 <- ggplot(df_final_actors, aes(x = group, y = frequency, group = `Principal Characters`, color = `Principal Characters`)) +
geom_line(size = 1.5) +
scale_color_manual(values = brewer.pal(n = 6, name = "Dark2")) +
geom_point(size = 3.2) +
scale_y_continuous(breaks = seq(0, 5600, by = 50), limits = c(0,560))+
theme_minimal(base_size = 14) +
labs(x = "Number of Season",
y = "Frequencies of appreances",
title = "Appearances of principal characters by Season",
caption="Description: This plot show the number of times that the \n principal characters appears in HIMYM per season.")+
theme(panel.grid.major=element_line(colour="#cfe7f3"),
panel.grid.minor=element_line(colour="#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
#axis.text.x=element_text(size=15),
#axis.text.y=element_text(size=15),
plot.caption=element_text(size=12, hjust=.1, color="#939393"),
legend.position="bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 40, # Bottom margin
l = 10), # Left margin
text=element_text()) +
#geom_segment(aes(x = 8.5, y = 75, xend = 8.8, yend = 70),
# arrow = arrow(length = unit(0.1, "cm")))+
guides(colour = guide_legend(ncol = 6))
ggdraw(ggplot1) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
Wordcloud plots are super useful to realize how many words and the repetition of them in a text.
#Remember our second step: tokenize our corpus.
toks_himym_characters <- tokens(corp_himym, #corpus from all the episodes from all season
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_keep(c(unique(df_characters_w$name))) #This function allow us to keep just the tokens that we want.
#In this case, we just want the characters.
toks_himym_characters## Tokens consisting of 208 documents and 12 docvars.
## "Pilot" :
## [1] "TED" "Marshall" "Marshall" "Ted" "Lily" "Lily"
## [7] "Marshall" "Marshall" "Barney" "Marshall" "Lily" "Marshall"
## [ ... and 58 more ]
##
## "Purple Giraffe" :
## [1] "TED" "Robin" "Barney" "Ted" "Ted" "Ted" "Robin" "Robin"
## [9] "Lily" "Ted" "Ted" "Lily"
## [ ... and 61 more ]
##
## "Sweet Taste of Liberty" :
## [1] "TED" "Marshall" "Lily" "Barney" "Robin" "Ted"
## [7] "TED" "Barney" "Ted" "Marshall" "Lily" "Marshall"
## [ ... and 54 more ]
##
## "Return of the Shirt" :
## [1] "TED" "TED" "Barney" "Robin" "Robin" "Ted"
## [7] "Ted" "Lily" "Lily" "Marshall" "Marshall" "Lily"
## [ ... and 18 more ]
##
## "Okay Awesome" :
## [1] "TED" "Robin" "Marshall" "Lily" "TED" "Lily"
## [7] "Lily" "Marshall" "Ted" "Ted" "Marshall" "Marshall"
## [ ... and 27 more ]
##
## "Slutty Pumpkin" :
## [1] "TED" "Lily" "Robin" "Robin" "Ted" "Ted" "Ted" "TED" "TED"
## [10] "Ted" "LILY" "Ted"
## [ ... and 24 more ]
##
## [ reached max_ndoc ... 202 more documents ]
#Remember our third step: DFM object
dfm_general_characters <- toks_himym_characters %>%
dfm()
#In this case, we just want the characters.
toks_himym_characters## Tokens consisting of 208 documents and 12 docvars.
## "Pilot" :
## [1] "TED" "Marshall" "Marshall" "Ted" "Lily" "Lily"
## [7] "Marshall" "Marshall" "Barney" "Marshall" "Lily" "Marshall"
## [ ... and 58 more ]
##
## "Purple Giraffe" :
## [1] "TED" "Robin" "Barney" "Ted" "Ted" "Ted" "Robin" "Robin"
## [9] "Lily" "Ted" "Ted" "Lily"
## [ ... and 61 more ]
##
## "Sweet Taste of Liberty" :
## [1] "TED" "Marshall" "Lily" "Barney" "Robin" "Ted"
## [7] "TED" "Barney" "Ted" "Marshall" "Lily" "Marshall"
## [ ... and 54 more ]
##
## "Return of the Shirt" :
## [1] "TED" "TED" "Barney" "Robin" "Robin" "Ted"
## [7] "Ted" "Lily" "Lily" "Marshall" "Marshall" "Lily"
## [ ... and 18 more ]
##
## "Okay Awesome" :
## [1] "TED" "Robin" "Marshall" "Lily" "TED" "Lily"
## [7] "Lily" "Marshall" "Ted" "Ted" "Marshall" "Marshall"
## [ ... and 27 more ]
##
## "Slutty Pumpkin" :
## [1] "TED" "Lily" "Robin" "Robin" "Ted" "Ted" "Ted" "TED" "TED"
## [10] "Ted" "LILY" "Ted"
## [ ... and 24 more ]
##
## [ reached max_ndoc ... 202 more documents ]
textplot_wordcloud(dfm_general_characters,
rotation = 0.25,
min_size = 1.4, max_size = 8,
min_count = 1, #Minimum frequency
color = brewer.pal(11, "RdBu"))
#RColorBrewer::display.brewer.all()
Now, we do the same, but with our seconday characters.
#Remember our second step: tokenize our corpus.
toks_himym_sec_characters <- tokens(corp_himym, #corpus from all the episodes from all season
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_keep(c(unique(df_characters_w$name))) %>% #We want to keep all the characters
tokens_remove(c("Ted", "Barney", "Lily", "Robin", "Marshall")) #But we remove the principal characters#Remember our third step: DFM object
dfm_general_sec_characters <- toks_himym_sec_characters %>%
dfm()textplot_wordcloud(dfm_general_sec_characters,
random_order = FALSE,
rotation = 0.25,
min_size = 1, max_size =5,
labelsize = 1.5,
min_count = 1, #Minimum frequency
color = RColorBrewer::brewer.pal(8, "Spectral"))
spacyr provides a convenient R wrapper around the Python spaCy package. It offers easy access to the following functionality of spaCy. This package is amazing because here what spacyr is doing is clasifying automatically our words into nouns, adjectives, verbs, dates and much more.
Of course, it is not 100% accurate, but it is an amazing tool to do some analysis!
![]()
library(spacyr)
spacy_install()
spacy_initialize(model = "en_core_web_sm")
sp_parse_doc <- spacy_parse(df_himym_final_doc, tag=TRUE)sp_parse_docsp_parse_var <- full_join(sp_parse_doc, df_himym_final_doc, by = c("doc_id"))
#In this case, we will just look the proper names and adjectives.
sp_parse_var_PROPN <- sp_parse_var %>% filter(pos=="PROPN" & stringr::str_starts(entity, "PERSON_B"))
sp_parse_var_ADJ <- sp_parse_var %>% filter(pos=="ADJ")We will get a wordcloud using the spacYr output. We will divide for this examples into adjectives and other features. Please, checl the package, it’s amazing.
#Remember our second step: tokenize our corpus.
toks_himym_ADJ <- tokens(corp_himym, #corpus from all the episodes from all season
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_keep(c(unique(sp_parse_var_ADJ$lemma))) %>% #We want to keep all the adjective
tokens_remove(c(stopwords("english"), "oh", "yeah", "okay", "like",
"get", "got", "can", "one", "hey", "go",
"Ted", "Marshall", "Lily", "Robin", "Barney", "just",
"know", "well", "right", "even", "see",
"sure", "back", "first", "said", "maybe", "wedding",
"whole", "wait")) #But we remove stopwords and other words that the package didn't classify it correctly. #Remember our third step: DFM object
df_general_ADJ <- toks_himym_ADJ %>%
tokens_group(groups = Season_w) %>% #group by season
dfm() %>% dfm_subset(Season < 9)Look how amazing are the adjectives distributed into the 8 seasons. Unfortunately the function only allows us 8 groups.
textplot_wordcloud(df_general_ADJ,
random_order = FALSE,
rotation = 0.25,
comparison = TRUE,
labelsize = 1.5,
min_count = 1, #Minimum frequency
color = ggsci::pal_lancet(palette = "lanonc"))
We will get a frequency of adjectives using the spacYr output. We repeat, we can do amazing things in terms of analysis.
#Remember our second step: tokenize our corpus.
freq_gen_dfm <- toks_himym_ADJ %>%
dfm()#Generate dataframe
df_freq_gen_dfm <- as.data.frame(textstat_frequency(freq_gen_dfm, # Our DFM object
n = 10, #Number of observations displayed
groups = Season)) #Grouped by season
df_freq_gen_dfm_match <- df_freq_gen_dfm %>% mutate(total = 1) %>%
group_by(feature) %>%
summarise(total = sum(total)) %>%
filter(total== 9)
df_freq_gen_dfm_final <- right_join(df_freq_gen_dfm, df_freq_gen_dfm_match,
by = "feature") %>% rename(Word = feature) %>%
mutate(Word = str_to_title(Word))Look the frequency of adjectives. It’s amazing how the word sorry appears and tends to be the one that our beautiful characters keep using.
ggplot2 <- ggplot(df_freq_gen_dfm_final, aes(x = group, y = frequency, group = Word, color = Word)) +
geom_line(size = 1.5, show.legend = TRUE) +
scale_color_manual(values = rev(brewer.pal(n = 7, name = "Dark2"))) +
geom_point(size = 3.2) +
theme_minimal(base_size = 14) +
labs(x = "Number of Season",
y = "Frequencies of words",
title = "Frequency of adjectives",
caption="Description: This plot shows the top adjectives that appears in every season of HIMYM")+
theme(panel.grid.major=element_line(colour="#cfe7f3"),
panel.grid.minor=element_line(colour="#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
#axis.text.x=element_text(size=15),
#axis.text.y=element_text(size=15),
plot.caption=element_text(size=12, hjust=.1, color="#939393"),
legend.position="bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 40, # Bottom margin
l = 10), # Left margin
text=element_text()) +
#geom_segment(aes(x = 8.5, y = 75, xend = 8.8, yend = 70),
# arrow = arrow(length = unit(0.1, "cm")))+
guides(colour = guide_legend(ncol = 4)) +
gghighlight(max(frequency) > 140,
keep_scales = TRUE,
unhighlighted_params = list(colour = NULL, alpha = 0.2))
ggdraw(ggplot2) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
How the characters are related each other? We will find it with the amazing function network plot.
#Remember our second step: tokenize our corpus.
token_characters_himym <- tokens(corp_himym, #corpus from all the episodes from all season
remove_punct = TRUE, #Remove punctuation of our texts
remove_separators = TRUE, #Remove separators of our texts
remove_numbers = TRUE, #Remove numbers of our texts
remove_symbols = TRUE) %>% #Remove symbols of our texts
tokens_keep(c(unique(df_characters_w$name))) %>% #We want to keep all the characters
tokens_tolower() #We want lower cases in our tokens#Extra step: create a feature co-ocurrence matrix (FCM)
fcm_characters_himym <- token_characters_himym %>%
fcm(context = "window", window = 5, tri = FALSE)#Vector with all the characters
v_top_characters <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym, 70)))
set.seed(100)
textplot_network(fcm_select(fcm_characters_himym, v_top_characters),
edge_color = "#008eed",
edge_size = 2,
vertex_labelcolor = "#006fba",
omit_isolated = TRUE,
min_freq = .1)
If we want to be more specific, then we can reduce our network plot into just 30 characters.
#Vector with 30 characters
v_top_characters_2 <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym, 30)))
textplot_network(fcm_select(fcm_characters_himym, v_top_characters_2),
edge_color = "#008eed",
edge_size = 5,
vertex_labelcolor = "#006fba",
omit_isolated = TRUE,
min_freq = .1)
If we want to be even more specific, we can even reduce our network plot and weight it with just one character. In this case, Ted.
fcm_characters_himym_ted <- token_characters_himym %>%
tokens_remove(c("marshall", "lily", "barney", "robin")) %>% #Here we just want ted, that why we remove the other principal characters
fcm(context = "window", window = 5, tri = FALSE)
#Vector with 30 characters
v_top_characters_3 <- stringr::str_to_sentence(names(topfeatures(fcm_characters_himym_ted, 30)))
#Create a FCM matrix with our characters
vertex_size_f <- fcm_select(fcm_characters_himym_ted, pattern = v_top_characters_3)
#Create a proportion
v_proportion <- rowSums(vertex_size_f)/min(rowSums(vertex_size_f))
#Vector of Ted
x_p <- c("ted")
#Replace that proportion in our vector
final_v <- replace(v_proportion, names(v_proportion) %in% x_p,
v_proportion[names(v_proportion) %in% x_p]/15)textplot_network(fcm_select(fcm_characters_himym_ted, v_top_characters_3),
edge_color = "#008eed",
edge_size = 5,
vertex_labelcolor = "#006fba",
omit_isolated = TRUE,
vertex_labelsize = final_v,
min_freq = .1)
##🎹 Text stat collocation
Identify and score multi-word expressions, or adjacent fixed-length collocations, from text using textstat_collocations().
We want to see which phrases tend to be the more used ones.
#Remember our second step: tokenize our corpus.
toks_himym_s1 <- tokens(corp_himym_s1_simil, #Define our corpus for the first season
padding = TRUE) %>% #Leave an empty string where the removed tokens previously existed
tokens_remove(stopwords("english")) #Remove stopwords of our tokenhimym_s1_collocations <-textstat_collocations(toks_himym_s1, #Our token object
tolower = F) #Keep capital letters
df_himym_s1_coll <- data.frame(himym_s1_collocations) %>%
rename(`Total of collocations` = count)Good! look how collocations like right now, get married, party number are the most used ones in the first season.
ggplot3 <- ggplot(df_himym_s1_coll, aes(x = z, y = lambda, label = collocation)) +
geom_point(alpha = 0.2, aes(size = `Total of collocations`), color = "#00578a")+
geom_point(data = df_himym_s1_coll %>% filter(z > 15),
aes(x = z, y = lambda, size = `Total of collocations`),
color = '#00578a') +
geom_text_repel(data = df_himym_s1_coll %>% filter(z > 15), #Function from ggrepel package. Show scatterplots with text.
aes(label = collocation, size = count), size = 3,
box.padding = unit(0.35, "lines"),
point.padding = unit(0.3, "lines")) +
scale_y_continuous(breaks = seq(0, 16, by = 1), limits = c(0,16))+
theme_minimal(base_size = 14) +
labs(x = "Z statistic",
y = "Lambda",
title = "Allocations of words in the First Season",
caption = "Description: This plot identifies and scores multi-word expressions of the 1st season")+
theme(panel.grid.major = element_line(colour = "#cfe7f3"),
panel.grid.minor = element_line(colour = "#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
#axis.text.x=element_text(size=15),
#axis.text.y=element_text(size=15),
plot.caption = element_text(size=12, hjust=.1, color="#939393"),
legend.position="bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 10, # Bottom margin
l = 10))
ggdraw(ggplot3) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
What about now to look the most iconics phrases in HIMYM. It’s going to be…wait for it…legendary. We can do that to with locate keywords in context.
#Set dataframe to merge with other information--------------------------
df_title_s_chp <- df_himym_final_doc %>%
select(Title, Season, Chapter, No.overall,
Season_w, US.viewers.millions.)
#First step: Define a corpus --------------------------------------
corp_himym <- corpus(df_himym_final_doc) # build a new corpus from the texts
docnames(corp_himym) <- df_himym_final_doc$Title #Rename docnames with Title of the episode
corp_himym_s5 <- corpus_subset(corp_himym, #our corpus
Season == 5) #Filter by seasonAn example with the word: Love
toks_himym_s5 <- tokens(corp_himym_s5, #Corpus of season 5
padding = TRUE)kw_himym_s5_love <- kwic(toks_himym_s5, #token object.
pattern = "love*", #pattern that we want to look for.
window = 10) #how many words you want before and after your pattern.df_kw_himym_s5_love <- as.data.frame(kw_himym_s5_love) %>%
rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>%
rename_with(str_to_title, .cols = everything()) %>% left_join(df_title_s_chp,
by ="Title") %>%
relocate(Title, Season, Chapter)
df_kw_himym_s5_loveThat’s amazing: it’s seems that the word love appears 150 times just in the fifth season.
An example with the word: legendary
toks_himym <- tokens(corp_himym, #Define our corpus for all seasons
padding = TRUE) #Leave an empty string where the removed tokens previously existedkw_himym_legendary <- kwic(toks_himym, #token object.
pattern = "legendary*", #pattern that we want to look for.
window = 10) #how many words you want before and after your pattern.df_kw_himym_legendary <- as.data.frame(kw_himym_legendary) %>%
rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>%
rename_with(str_to_title, .cols = everything()) %>% left_join(df_title_s_chp,
by = "Title") %>%
relocate(Title, Season, Chapter)
df_kw_himym_legendaryMmmhhh, we tought that the word legendary was going to appeared more. Maybe they didn’t mention that so often.
We can even do phrases like: Wait for it. But don’t worry, you don’t need to wait us. We are here.
kw_himym_wait_for <- kwic(toks_himym, #token object.
pattern = phrase("wait for it"), #Here we can specify even a phrase
window = 10) #how many words you want before and after your pattern.df_kw_himym_wait_for <- as.data.frame(kw_himym_wait_for) %>%
rename(Title = docname,`Pre Sentence` = pre, `Post Sentence` = post)%>%
rename_with(str_to_title, .cols = everything()) %>% left_join(df_title_s_chp,
by = "Title") %>%
relocate(Title, Season, Chapter)
df_kw_himym_wait_forEXTRA: just because we were having a lot of fun with this package. We are going to do a quick sentiment analysis.
toks_himym <- tokens(corp_himym, #Our corpus object
remove_punct = TRUE, #Remove punctuation in our texts
remove_separators = TRUE, #Remove separators in our texts
remove_numbers = TRUE, #Remove numbers in our texts
remove_symbols = TRUE) %>% #Remove symbols in our texts
tokens_remove(stopwords("english"))#Add additional words
#tidy_sou <- df_himym_final_doc %>%
# unnest_tokens(word, text) This is another way on spacyrWe will use the get_sentiments functions to get positive and negative words. We have four sources. We are going to use bing, but you can choose the one that you like the most.
df_positive_words <- get_sentiments("bing") %>% #We have four options: "bing", "afinn", "loughran", "nrc"
filter(sentiment == "positive")
df_negative_words <- get_sentiments("bing") %>%
filter(sentiment == "negative")We must define a dictionary to put it into a dictionary and pass it thorugh a dfm object. We know that you are an expert on that now.
#Define a dictionary with positive and negative words from bing --------------------------------------
l_sentiment_dictionary <- dictionary(list(positive = df_positive_words,
negative = df_negative_words))💡 Warning: this functions takes 30 minutes: be patience. Don’t worry, we will charge the dataframe for you.
dfm_sentiment_himym <- dfm(toks_himym) %>% dfm_lookup(dictionary = sentiment_dictionary)We will charge the document for you. We got you.
##Load a file
#It is a DFM object, which comes from a token off all the season of HIMYM
load(file = "data/dfm_sentiment_himym.Rdata")
#Rename doc:id with the Titles of every episode
docnames(dfm_sentiment_himym) <- df_himym_final_doc$TitleWe will give a format to our dataframe.
#Format in long to plot positive and negative words
df_sentiment_himym <- convert(dfm_sentiment_himym, "data.frame") %>%
gather(positive.word, negative.word, key = "Polarity", value = "Words") %>%
rename(Title = doc_id) %>%
mutate(Title = as_factor(Title)) %>%
left_join(df_title_s_chp, by ="Title") %>%
mutate(Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "negative.word",
replacement = "Negative words")),
Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "positive.word",
replacement = "Positive words")))
ggplot4 <- ggplot(df_sentiment_himym, aes(x = Chapter, y = Words, fill = Polarity, group = Polarity)) +
geom_bar(stat = 'identity', position = position_dodge(), size = 1) +
facet_wrap(~ Season_w)+
scale_fill_manual(values = c("#c6006f", "#004383")) +
scale_y_continuous(breaks = seq(0, 250, by = 50))+
theme_minimal(base_size = 14) +
labs(x = "Episodes",
y = "Frequency of words",
title = "Total of positive and negative words per season",
caption="Description: This plot identifies total of positive and negative words \n per season and episode")+
theme(panel.grid.major = element_line(colour="#cfe7f3"),
panel.grid.minor = element_line(colour="#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
#axis.text.x=element_text(size=15),
#axis.text.y=element_text(size=15),
plot.caption = element_text(size = 12, hjust = .1, color = "#939393"),
legend.position = "bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 10, # Bottom margin
l = 10))
ggdraw(ggplot4) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
Look the total (raw) words between positive and negative words per season
dfm_weight() We can be more fair. Let’s know calculate the weight of the words.
This step is the same as the last one, but here we are taking into account the weights to do a fair comparison
dfm_sentiment_himym_prop <- dfm_weight(dfm_sentiment_himym, scheme = "prop")
dfm_sentiment_himym_prop## Document-feature matrix of: 208 documents, 4 features (47.72% sparse) and 11 docvars.
## features
## docs positive.word positive.sentiment negative.word
## "Pilot" 0.6621005 0 0.3378995
## "Purple Giraffe" 0.6722222 0 0.3277778
## "Sweet Taste of Liberty" 0.6510417 0 0.3489583
## "Return of the Shirt" 0.5977011 0 0.4022989
## "Okay Awesome" 0.6257310 0 0.3742690
## "Slutty Pumpkin" 0.6267281 0 0.3732719
## features
## docs negative.sentiment
## "Pilot" 0
## "Purple Giraffe" 0
## "Sweet Taste of Liberty" 0
## "Return of the Shirt" 0
## "Okay Awesome" 0
## "Slutty Pumpkin" 0
## [ reached max_ndoc ... 202 more documents ]
It seems that HIMYM is positive after all. Amazing.
df_sentiment_himym_prop <- convert(dfm_sentiment_himym_prop, "data.frame") %>%
gather(positive.word, negative.word, key = "Polarity", value = "Words") %>%
rename(Title = doc_id) %>%
mutate(Title = as_factor(Title)) %>%
left_join(df_title_s_chp, by = "Title") %>%
mutate(Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "negative.word",
replacement = "Negative words")),
Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "positive.word",
replacement = "Positive words")))
### 14.07.02.- Plot total of positive and negative words per season and episode -----
#This step is the same as the last one, but here we are taking into account the weights to do a fair comparison
ggplot5 <- ggplot(df_sentiment_himym_prop, aes(x = Chapter, y = Words, fill = Polarity, group = Polarity)) +
geom_bar(stat = 'identity', position = position_dodge(), size = 1) +
facet_wrap(~ Season_w) +
scale_fill_manual(values = c("#c6006f", "#004383")) +
scale_y_continuous(breaks = seq(0, .8, by = .2))+
theme_minimal(base_size = 14) +
labs(x = "Episodes",
y = "Frequency of words",
title = "Weighted positve and negative words per season",
caption = "Description: This plot identifies the weighted total of positive and negative words \n per season and episode")+
theme(panel.grid.major = element_line(colour = "#cfe7f3"),
panel.grid.minor = element_line(colour = "#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
#axis.text.x=element_text(size=15),
#axis.text.y=element_text(size=15),
plot.caption = element_text(size = 12, hjust = .1, color = "#939393"),
legend.position = "bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 10, # Bottom margin
l = 10))
ggdraw(ggplot5) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
Now let’s do a rate to check in which episodes it can be more a negative context. We will use Scaling Policy Preferences from Coded Political Texts from WILL LOWE, KENNETH BENOIT, SLAVA MIKHAYLOV, MICHAEL LAVER.
They use a balance between positive words/negative words using a log scale, which you can see on the code.
#Here we
df_sentiment_himym_prop_measure <- convert(dfm_sentiment_himym_prop, "data.frame") %>%
rename(Sentiment = positive.word) %>% rename(Title = doc_id) %>%
left_join(df_title_s_chp, by = "Title") %>%
mutate(measure = log((Sentiment + 0.5)/(negative.word + .5))) %>%
select(-Season) %>%
rename(Season = Season_w)dfm_sentiment_himym_prop <- dfm_weight(dfm_sentiment_himym, scheme = "prop")
dfm_sentiment_himym_prop## Document-feature matrix of: 208 documents, 4 features (47.72% sparse) and 11 docvars.
## features
## docs positive.word positive.sentiment negative.word
## "Pilot" 0.6621005 0 0.3378995
## "Purple Giraffe" 0.6722222 0 0.3277778
## "Sweet Taste of Liberty" 0.6510417 0 0.3489583
## "Return of the Shirt" 0.5977011 0 0.4022989
## "Okay Awesome" 0.6257310 0 0.3742690
## "Slutty Pumpkin" 0.6267281 0 0.3732719
## features
## docs negative.sentiment
## "Pilot" 0
## "Purple Giraffe" 0
## "Sweet Taste of Liberty" 0
## "Return of the Shirt" 0
## "Okay Awesome" 0
## "Slutty Pumpkin" 0
## [ reached max_ndoc ... 202 more documents ]
Plot measure of positivity among season
Woooow! We confirm that is a positive series, but it’s interesting how certain episodes, mostly from the last season, have a negative context. This total makes sense because by that time Lily was fighting with Marshall for their baby and Robin, Ted and Barney were with problems.
df_sentiment_himym_prop <- convert(dfm_sentiment_himym_prop, "data.frame") %>%
gather(positive.word, negative.word, key = "Polarity", value = "Words") %>%
rename(Title = doc_id) %>%
mutate(Title = as_factor(Title)) %>%
left_join(df_title_s_chp, by = "Title") %>%
mutate(Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "negative.word",
replacement = "Negative words")),
Polarity = replace(Polarity, is.character(Polarity),
str_replace_all(Polarity,
pattern = "positive.word",
replacement = "Positive words")))
ggplot6 <- ggplot(df_sentiment_himym_prop_measure, aes(x = No.overall, y = measure,
color = Season, group = Season)) +
scale_color_manual(values = brewer.pal(n = 9, name = "Set1"))+
geom_line(size = 1.5) +
geom_point(size = 3.2) +
scale_x_continuous(breaks = seq(0, 208, by = 20))+
theme_minimal(base_size = 14) +
labs(x = "Number of episode",
y = "Rate",
title = "Measure of positivity among episodes",
caption="Description: This plot shows the positivity rate of every episode")+
theme(panel.grid.major = element_line(colour = "#cfe7f3"),
panel.grid.minor = element_line(colour = "#cfe7f3"),
plot.title = element_text(margin = margin(t = 10, r = 20, b = 30, l = 30)),
plot.caption = element_text(size=12, hjust = .1, color = "#939393"),
legend.position = "bottom",
plot.margin = margin(t = 20, # Top margin
r = 50, # Right margin
b = 40, # Bottom margin
l = 10), # Left margin
text = element_text()) +
guides(colour = guide_legend(ncol = 3)) +
geom_hline(yintercept = 0, linetype = "dashed",
color = "red", size = 1)
ggdraw(ggplot6) + draw_image(obj_img, x = .97, y = .97,
hjust = 1.1, vjust = .7,
width = 0.11, height = 0.1)
A work by Jorge Roa, Augusto Fonseca & Alexander KRaess
Prepared for Intro to Data Science Workshop 2022